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Abstract 

Background: The analysis of complex diseases is an important problem in human genetics. Because multifactoriality 
is expected to play a pivotal role, many studies are currently focused on collecting information on the genetic and 
environmental factors that potentially influence these diseases. However, there is still a lack of efficient and thoroughly 
tested statistical models that can be used to identify implicated features and their interactions. Simulations using large 
biologically realistic data sets with known gene-gene and gene-environment interactions that influence the risk of a 
complex disease are a convenient and useful way to assess the performance of statistical methods. 

Results: The Gene-Environment iNteraction Simulator 2 (GENS2) simulates interactions among two genetic and one 
environmental factor and also allows for epistatic interactions. GENS2 is based on data with realistic patterns of 
linkage disequilibrium, and imposes no limitations either on the number of individuals to be simulated or on number 
of non-predisposing genetic/environmental factors to be considered. The GENS2 tool is able to simulate 
gene-environment and gene-gene interactions. To make the Simulator more intuitive, the input parameters are 
expressed as standard epidemiological quantities. GENS2 is written in Python language and takes advantage of 
operators and modules provided by the simuPOP simulation environment. It can be used through a graphical or a 
command-line interface and is freely available from http://sourceforge.net/projects/gensim. The software is released 
under the GNU General Public License version 3.0. 

Conclusions: Data produced by GENS2 can be used as a benchmark for evaluating statistical tools designed for the 
identification of gene-gene and gene-environment interactions. 

Keywords: Gene-environment interaction, Computer simulation, Complex disease, Epistasis, Genetic, Genome-wide 
association study, Genetics, Population, SimuPOP, Linkage disequilibrium, Genomics 



Background 

Most of the common human diseases with high mortality 
rates (such as cancer, heart disease, obesity, diabetes, and 
several common psychiatric and neurological conditions) 
are classified as complex diseases [1,2]. By definition, a 
complex disease is a multifactorial complex trait generally 
caused by multiple predisposing loci (possibly interact- 
ing) and by the exposure to particular environmental 
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factors [3]. Although several genetic and environmental 
factors have been shown to affect susceptibility to partic- 
ular complex diseases, the intricate sets of relationships 
between these factors and disease susceptibility are not 
yet exhaustively understood. For this reason, typically, 
the proportion of risk accountable to genetics and envi- 
ronment remains mostly unpredictable [4], Explanations 
for their unpredictability might include the occurrence of 
still unidentified factors and/or the presence of non-linear 
interactions among already identified factors; for example, 
some combinations of genetic and environmental factors 
could have disease risks that are consistently higher than 
those predicted by a single component. 
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Gene-environment interactions (GxE) are expected to 
influence complex phenotypes, for example, disease risk. 
Hence individuals with predisposing genetics are more 
likely to develop a disease when exposed to a damaging 
environment than individuals, exposed to the same envi- 
ronment, without predisposing genetics [5,6]. The role 
of GxE is so relevant that it is generally accepted that 
neglecting them can lead to an underestimation of dis- 
ease risk, and may explain some of the inconsistencies in 
replications in different studies [7] . 

Complex phenotypes are regulated by pathways and 
biochemical mechanisms that involve many genetic prod- 
ucts. Hence, in addition to interactions among genes and 
environment, interactions among different genetic loci 
(GxG) can also influence disease risk. In particular, GxG 
are defined as epistatic when the allelic variations of one 
gene alters the effect of variations of another gene [8]. 
Epistasis has been identified in human diseases [9,10], and 
its role in public health has been highlighted [8] . 

Surprisingly, despite the general agreement on the rel- 
evance of GxE and GxG for correct disease risk estima- 
tions, only a few epidemiological studies have attempted 
to identify them. Indeed, studying the complex interac- 
tions among risk factors is a daunting task that requires 
large data sets and specific research designs. Furthermore, 
the best statistical method for the identification of G x G 
and GxE in case-control data sets [11,12] is still widely 
debated. The performance of statistical methods that are 
used for the identification of GxG and GxE are typi- 
cally influenced by many factors: sample size, number of 
involved factors, type of interaction, model of inheritance, 
allelic frequencies, distributions of the environmental fac- 
tors, and relative strength of different factors affecting 
disease risk. Unfortunately, only a few of these features 
are generally assessable in real populations. A further lim- 
itation of the epidemiological studies that have been per- 
formed so far, is the limited knowledge about the impact 
of linkage disequilibrium (LD) on association statistics in 
the presence of GxG and GxE. It has been demonstrated 
that when GxG occur and the assayed SNPs are not the 
functional ones but SNPs that are in LD with them, com- 
mon statistics like r 2 are generally inappropriate and tend 
to lead to an over/underestimation of disease risk [13]. 

A possible strategy to assess the performances of statis- 
tical methods is to test them against simulated data sets 
where the relevant features influencing the disease risk are 
known (for a review of genetic simulators see [14] and the 
North Shore LIJ Research Institute List of Genetic Analy- 
sis Software [15]). With this aim, some of the authors [14] 
of the present work proposed a novel approach to simulate 
case-control samples based on: 

1. a Multi-Logistic Model (MLM) that can model any 
type of GxG and GxE, 



2. a mathematical approach (Knowledge Aided 
Parameterization System, KAPS) that can translate 
biological and epidemiological information to MLM 
parameters, and 

3. GENS (Gene Environment iNteraction Simulator), a 
software that produces simulated data sets. 

Using that approach interactions between one genetic and 
one environmental factor only could be simulated; there- 
fore, it was not possible to account for epistatic GxG. 
Moreover, all simulated loci were considered to be inde- 
pendent and thus it was not possible to account for LD 
[16]. 

In the present paper, we describe an extension of the 
previous model that overcomes such limitations using 
a new strategy that simulates up to two-genes x one- 
environment interactions with the possible inclusion of 
epistasis. Importantly, the present algorithm can be eas- 
ily extended to manage more than two genetic and one 
environmental factors. However, to simplify the design of 
biologically meaningful interactions, we limited the num- 
ber of features (see the Discussion section for details). 
Furthermore, the inclusion of two genetic factors (with 
epistatic interaction) that in turn interact with a con- 
tinuous environmental factor heavily increased of the 
complexity of the model. Indeed, statistical methods that 
can deal with even two genetic factors are still far from 
being functionally useful for real, large data sets [17]. 
To provide a realistic genetic background to the simu- 
lated populations, we implemented our extended model 
as a module which can be integrated with simuPOP, a 
forward-time populations simulator that reproduces real- 
istic demographic and evolutionary features [18]. 

Implementation 

GENS2 workflow 

Figure 1 depicts the GENS2 algorithm flowchart that 
is used to generate case/control synthetic populations 
starting from a set of desired epidemiological parame- 
ters (Table 1). The simulation procedure has two main 
branches: the definition of genetic and environmental 
information for each individual (left side of Figure 1), and 
the translation of user desired epidemiological parameters 
along with GxE and GxG models into the correspond- 
ing MLM parameters (right side of Figure 1). The two 
branches merge in the last step of the procedure, where 
disease status is assigned to each individual. In the fol- 
lowing sections we will describe the three parts of the 
algorithm in detail, emphasizing the main advances of the 
new software over the previous one. 

Generation of the synthetic data set 

The generation of the starting sample is carried out by a 
series of simuPOP scripts [19] that 
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Epidemiological 


parameters 


• Number of individuals 


• Expected prevalence 


• Allelic frequencies 


• Genetic risk 


• Distribution of environment 


• Environmental risk 


• HapMap population 


• GxE interaction 




• GxG interaction (epistasis) 



Genetic data simulation 
(simuPOP) 



Parameters generation 
(KAPS) 



Environmental data simulation 



Risk function generation 
(MLM) 



Phenotype assignment 



Simulated sample 



Figure 1 GENS2 workflow. Chart of the steps that were used to 
simulate a complex disease in a population using the simuPOP and 
GENS2 systems. 



• download phased genomic data from the HapMap 
public database [20], 

• select a subset of SNPs or entire genomic regions, and 

• let the population evolve until it reaches the desired 
size and frequencies for some disease predisposing 
loci (DPLs). 

To obtain a synthetic data set, simuPOP drives a forward- 
time simulation to obtain a population that is composed of 
the desired number of individuals and genotypic frequen- 
cies for all the markers. The use of this simulator helps 
to retain genetic realism in the final population, in par- 
ticular with respect to the patterns of LD (for a detailed 
description of this process, please see [19]). When the 
genetic information for each individual has been obtained, 
GENS2 assigns environmental exposures following a user 
denned Gaussian distribution for the disease environ- 
mental variable, and several other user defined distribu- 
tions (Gaussian, Uniform or Binomial) for a number of 
other environmental variables not related with the disease 
(environmental-confusing variables) . 

Definition of the penetrance model 

The second branch of the simulation procedure (right 
side of Figure 1) is implemented in the Knowledge Aided 
Parameterization System 2 (KAPS2) subsystem which 
accepts the input of some standard epidemiological mea- 
sures for the relevant features listed here: 

• the expected prevalence of the disease in the sample, 

• the id in the input data set of one or two DPLs, 

• the allelic frequencies of DPLs (calculated 
automatically from the input population), 



• the effect on disease risk of each DPL in terms of the 
relative risk of the high risk homozygote compared 
with the other homozygote, 

• the dominance relation of each DPL (W), expressed 
as a number in the interval [0 — 1], representing the 
dominance relation (W=0 dominant, W=l recessive, 
0<W<1 co-dominant), and 

• the distribution parameters and the effect of the 
environmental factor on disease risk, expressed as 
odds ratio (OR) of the risk related to one-unit 
increase in the exposure. 

KAPS2 also requires GxE and GxG models when two 
DPLs are provided. In particular: 

i) GxE models are chosen from a set of four predefined 
models, two models of interaction between DPLs and 
the environment, and two special models in which 
there is no gene-environment interaction but in 
which only one genetic or environmental factor 
contributes to the disease risk (see Table 2). 

ii) GxG models (epistasis) are accepted in the form of 
percentage variations on the risk associated with a 
maximum number of three (out of the possible nine) 
combined genotypes. 

KAPS2 converts population features and GxE and GxG 
models into the corresponding parameters of the MLM in 
two steps. First, starting from the provided epidemiologi- 
cal parameters, KAPS2 calculates the penetrance of each 
combined genotype assuming no interaction between the 
genotypes of each locus. Epistasis (if defined) is then mod- 
eled through a deformation procedure, reflecting GxG 
variations, of the set of penetrance values that keeps it 
coherent with user defined epidemiological parameters. 
In this step, when there is more than one way to change 
the values of the set (i.e. less than three epistatic varia- 
tions are provided), a mathematical optimization system 
is employed to find the deformation characterized by the 
smallest variation on the values not constrained by user 
defined epistatic variations. An example of the results 
of the epistasis application is presented in Figure 2. In 
particular, the figure shows the disease penetrance for 
each combined genotype before (left panel) and after 
(right panel) the penetrance of one combined genotype 
(namely (3,3)) has increased by 20%. Thus, by following 
the procedure presented above, the variations in the dis- 
ease penetrance values of other combined genotypes are 
automatically distributed. 

Consequently, for each combined genotype, KAPS2 
computes the coefficients of a penetrance function of the 
environmental exposure that is associated with the com- 
bined genotype in the MLM. In this step GxE are also 
modeled; the Additive model (ADD) assumes that com- 
bined genotypes with higher penetrance have a higher 
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Table 1 The epidemiologic parameters that were used for the sample simulation 



Task 


Required parameters 


Description 


SimuPOP 


1) Starting data (Hap Map) 


Chromosomes, or chromosome regions, or markers 
and marker distance. 


The genomic regions containing the loci that will be 
simulated 




Population (ethnicity) 


The starting frequency and linkage data to be used 
in the sumulation 


2) Simulation of sample's genetic data 


DPLs (Disease Predisposing Loci) 


Loci that will influence the disease risk. 




Target allelic frequency 


Final allelic frequencies at the end of simuPOP sim- 
ulation 




Final sample size 


Number of individuals that compose the population 
by simuPOP 


GENS2 


Starting sample 


simuPOP generated sample 


Sample data generated with simuPOP 




Disease prevalence 


The expected disease prevalence in the whole sam- 
ple 


Environment 


Environmental factor distribution 


Distribution of the environmental exposure in the 
whole sample 




Environmental factor OR 


Odds ratio associated with one-unit-increase of the 
environmental exposure 




Noisy Environmental variables 


As many as desired confounding environmental 
exposures not associated with the disease risk 
(gaussian, binomial or uniform distributed) 


Genetics 


DPLs 


These are the same DPLs as selected in the simuPOP 
simulation 




High risk alleles 


The allele, for each DPL, associated with the highest 
disease risk 




DPLs genotypic RR 


The relative risk of the high risk homozygote versus 
low risk homozygote, for each DPL 




Dominance 


The relationship of the risk associated with the het- 
erozygote with that associated with the homozy- 
gotes (recessive, dominant, codominant) 




Epistasis model (GxG) 


Percent increase of the risk associated with each 
combined genotype 


Gene Environment interaction 


GxE model 


One of the four predefined interaction models 



basal disease risk, while the risk associated with the envi- 
ronmental factor is just added. On the other hand, in 
the modulative model (GEM), combined genotypes with 



Table 2 Predefined gene-environment interaction models 
in GENS2 



Interaction 


Description 


model 




Genetic Model 


Disease risk depends only on the genetics of 


(GEN) 


an individual 


Environmental 


Disease risk depends only on environmental 


Model (ENV) 


exposure of an individual 


Gene Environ 


The genetics modifies the effect of the envi- 


ment interaction 


ronment in modulating the disease risk 


Model (GEM) 


Additive Model 


The effects of environment and genetics are 


(ADD) 


independent and sums in modulating the 




disease risk 



higher penetrance have the same basal risk although they 
are more sensitive' to the effect of the environment (see 
the Methods section). 

Disease risk of an individual 

In the final step the two branches of the procedure 
(Figure 1) join. Once the genetics and the environmental 
exposure for each individual are given, its disease risk can 
be obtained by evaluating the penetrance function associ- 
ated with its combined genotype. The risk is then used to 
assign a disease status using a random process. 

Software 

To create simulated populations, we employed an existing 
tool, SimuPOP, and the implementation of the above 
described algorithm. Using SimuPOP it is possible to drive 
a forward-time simulation that results in a population 
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A No Epistasis B Epistasis 




Figure 2 Example of application of epistasis. Disease penetrance for combined genotypes before (left panel) and after (right panel) the 
application of an epistasis model with an increment of 20% of the risk associated with the (CC-TT) composed genotype. The x- and y- axes plot the 
reported genotypes of the two DPLs; the z-axis plots the risk associated with each combined genotype. 



composed by the desired number of individuals and hav- 
ing specified genotype frequencies for a set of markers. 
To be usable in GENS2, populations should be created 
in SimuPOP as described previously [19]. This procedure 
generates a data set that contains genotypic data as a set 
of bi-allelic loci, for each individual in the population. 

GENS2 accepts as input a population produced by 
SimuPOP and the ids of the DPLs chosen from those 
present in the data set. For each DPL, the penetrance 
values are provided as relative risk (RR) and domi- 
nance model (see [14]). RR is calculated as the risk 
ratio of the high over the low risk homozygotes. For 
each combination of DPL genotypes, the percentage vari- 
ation of the penetrance value if epistasis occurs can 
be assigned. GENS2 assigns environmental factor expo- 
sure on the basis of a user-defined Gaussian distribu- 
tion. Any remaining non-implicated loci can be consid- 
ered as background noise. Similarly, GENS2 can produce 
an arbitrary number of non-associated random envi- 
ronmental exposures, following a Gaussian, uniform or 
binomial distribution, that act as further background 
noise. 

On the basis of the selected type of GxG and 
GxE, GENS2 calculates the coefficients of the MLM as 
described in the Method section. 

For each individual, GENS2 assigns the disease status 
(affected or unaffected) on the basis of its disease risk by 
applying the MLM and using a random process. 

The main output of the software can be either a sin- 
gle file or several files for a set of subpopulations of a 
given size produced by means of a subsampling proce- 
dure. Subsampling allows bootstrapping procedures to 
be executed on data sets produced with the same fea- 
tures. The output of GENS2 is in the form of a table in 
which each row represents an individual and the columns 
contain, from the left to the right, disease status, gen- 
der, environmental exposures and genotypes for each 
individual. 



Two possible formats for the genetics output are avail- 
able: phased haplotypes or genotypes. In both output for- 
mats the initial columns are identical to those described 
above; however, they differ in the way the genetic infor- 
mation for each individual is represented. In the phased 
haplotype format, there are two columns for each SNP 
that report the allele status (either A, C, T or G) on each 
chromosome. In the genotype format, each SNP is rep- 
resented by one number (1, 2 or 3), where 2 represents 
the heterozygote and 3 represents, for DLPs, the high risk 
homozygote or, for all the other SNPs, the lower frequency 
homozygote. 

In addition to the main output file, GENS2 outputs a 
log file that contains an extensive report of all the inter- 
mediate steps in the procedure and the values used to 
obtain the populations. Optionally, a file containing the ID 
and genomic position of the SNPs in the data set can be 
returned. 

GENS2 is designed to be used either from the command 
line as a Python script, or through a graphical user inter- 
face, similar to a wizard, that prompts the user in the spec- 
ification of all required parameters [see Additional file 1]. 

Overall, the computational time complexity of the sim- 
ulation procedure depends by both simuPOP and GENS2. 
For GENS2, the procedure is dominated by the assign- 
ment of the disease status to all individuals in the pop- 
ulation. Indeed, after the KAPS2 module has performed 
the translation of user provided parameters into MLM 
parameters in bounded constant time, the time complex- 
ity becomes linear in the number of individuals and the 
number of represented variables (genotypes and envi- 
ronmental exposures) for each individual in the simu- 
lated population. On the other hand, the amount of time 
required to perform a simulation with simuPOP depends 
on the size of the simulation and scales roughly linearly 
with the number of markers and individuals that are 
used [19]. GENS2 is written in Python and is completely 
open-source. The software is freely available from [21]. 
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Results and discussion 

Here we describe a method based on the MLM to simu- 
late two genetic and one environmental factors interacting 
in the determination of a disease risk. The method is 
implemented in GENS2, a software that is freely available. 

To test populations produced by GENS2, we performed 
a set of analyses on some representative populations. The 
aim was to emulate a case in which GENS2 was used to 
assess the performances of a feature selection method. In 
particular, all the analyses were performed using a logis- 
tic regression (glm function in R) with a different model 
for each test and considering the status as the dependent 
variable. 

The first test was a single-marker analysis on a popu- 
lation of 1,000 cases and 1,000 controls with two DPLs 
in two distinct genomic regions, with no epistasis and an 
additive GxE model. The association of each marker with 
the status was tested using logistic regression analyses 
with model: disease risk = genetic factor + environmental 
factor. As expected, the most significant associations were 
those of DPLs [see Additional file 2]. The result showed 
that the environmental variable was also associated with 
the disease (p < 10 -6 ). Furthermore, non-causative mark- 
ers in LD with the two DPLs also showed a significant 
association that was roughly proportional to the value of 
r 2 with the DPLs. 

The second test was similar to the first, except that 
10,000 cases and 10,000 controls and a modulative GxE 
model for the DPLs were used. For this test, the logis- 
tic regression was used by considering both an additive 
model (disease risk = genetic factor + environmental fac- 
tor) and a multiplicative model (disease risk = genetic 
factor * environmental factor). None of the markers, when 
tested by additive model, reached a Bonferroni corrected 
significance level [see Additional file 3, middle panel]. 
Conversely, DPLs were found to be significant when 
the multiplicative model was explicitly considered. Non- 
causative markers in LD with them were also found to be 
significant. Notably, this more complex model required a 
10-fold increase in the sample size to achieve the same 
significance level as the previous test. 

Finally, we tested an example of two DPLs with no 
marginal risk, an epistatic interaction (+20% penetrance 
for the (3,3) combined genotype) and an additive GxE 
model in a population of 5,000 cases and 5,000 controls. 
Because of the higher computational cost of this analy- 
sis, we performed the test on only a subset of about 1,200 
markers surrounding the two DPLs. The results are dis- 
played in Figure 3. The top panel shows the results of 
a single-marker analysis. As expected, no markers were 
found to be significantly associated when tested individ- 
ually. Thus, all possible 2-markers interactions (bottom 
panel) were tested. Only the gene-gene interactions of 
DPLs and of markers in strong LD with them were found 



to be significantly associated with the status after a Bon- 
ferroni correction (red dots). 

The model described here can handle, in principle, any 
number of DPLs and environmental variables. However, 
we chose to limit the implementation to a relatively small 
number of factors (two genetic and one environmental) 
so that setting up the model does not become too com- 
plicated for the user. In this way, we reached a balance 
between the complexity of the represented phenomena 
and simplicity in the definition of the model. Moreover, 
the best strategy to identify even simple interactions as 
single GxG and GxE with binary environmental vari- 
ables it is still debated (for an example of the debate, 
see the report on the 2009 Genetic Analysis Workshop 
[11,12,17]). For this reason, we believe that a set of simu- 
lated populations in which all features are known provides 
an important tool for the identification of the best strategy 
to identify and study GxG and GxE. 

Several methods simulating genetic data have been pro- 
posed, many of them also handle complex LD patterns 
and polygenic traits [22,23]. For example, HAPGEN2 [22] 
simulates multiple polymorphic loci that are in LD and 
in the same chromosomal region. HAPGEN2 can simu- 
late GxG, including epistasis, between two loci; however, 
the available models are limited to a predefined set. More- 
over, this program cannot handle GxE and the number 
of cases and controls that are produced cannot be con- 
trolled. Another tool, gs [23], similarly simulates multiple 
loci with a realistic pattern of LD; it can modulate a quan- 
titative trait (as disease risk) and can also manage GxG 
and GxE. In gs the user can define GxG between two 
loci in two ways: one, by providing a penetrance matrix 
for combined genotypes or, two, by selecting a penetrance 
matrix from a predefined set of GxG models. However, 
both these approaches have some limitations. Although 
the first allows great control over the penetrance of each 
genotype, it easily leads to a loosening of control on 
marginal effects, making the replication of real popula- 
tions difficult. The second approach, on the contrary, is 
too restrictive and does not allow any possible interac- 
tion to be simulated. For simulation of GxG, gs allows 
the user to input a list of rules regarding specific combi- 
nations of genotypes and levels of environmental values, 
and the corresponding risk levels. Again, this approach 
makes it very difficult to control the overall characteristics 
of populations in terms of marginal effects of genetic and 
environmental factors. The approaches described above 
can simulate complex interactions by loosening control 
on overall population characteristics or, alternatively, can 
keep the overall population characteristics under con- 
trol by limiting the freedom to specify the interactions. 
Another strategy to simulate GxG and GxE is to man- 
ually write functions or sets of rules that associate each 
combination of genes and environmental factors to a risk 
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Figure 3 Association test for the case of epistatic interaction. The population comprised 5,000 cases and 5,000 controls. Two DPLs with no 
marginal risk(RR=1), an epistatic interaction (+20% penetrance for the (3,3) combined genotype) and an additive GxE model (odds ratio (OR)=1.2) 
were used. The two DPLs are in two distinct genomic regions (Chr 8: 1 1 7,948,1 82 - 1 1 9,256,695 in yellow; Chr 1 0: 1 1 4,408,939 - 1 1 5,256,799 in cyan). In 
the upper panel, the Manhattan plot shows the significance of the association (- log 10 (p-value)) of each marker when tested individually (each dot 
represents a different marker). The red dashed line represents the significance threshold (0.05 after Bonferroni correction) and the green dashed lines 
mark the position of DPLs. In the middle panel, the r 2 of each marker with the DPL in the same region is shown. In the bottom panel, the significance 
of the association for each 2-loci interaction (grey scale, nonsignificant; red scale, significant at a 0.05 level after Bonferroni correction) is shown. 



value. Although this strategy provides more freedom, it 
is very difficult to set up when the control of marginal 
effects of single factors is desired. We believe that GENS2 
provides a better balance between the freedom to define 
possible interactions among factors and the control of the 
overall population characteristics. 

The simulated populations produced with GENS2 can 
be thought of as a sampling of an ideal infinite popula- 
tion that has the characteristics specified by the user. From 
this point of view, it is easy to understand that fluctua- 
tions of observed values around the expected ones can 
occur. Among the elements that mostly affect these fluctu- 
ations, are sample size, allele frequencies, and penetrance 
values. In particular, small sample sizes increase the effect 
of sampling error and thus, as expected, these fluctua- 
tions tend to vanish as the sample size is increased [see 
Additional file 4]. This property can be used to test statis- 
tical methods because it allows the user to assess how well 



the methods perform as population and sample features 
change; conversely, it helps to assess the statistical power 
of methods in the presence of population features that can 
only be presumed. 

Although the GENS2 part of the simulation process is 
reasonably fast, the procedure to simulate large popula- 
tions using simuPOP takes time to complete. It would be 
difficult to simulate a large number of samples without a 
cluster system, unless multiple (small) samples are drawn 
from the same large population. 

Conclusions 

GENS2 allows the simulation of gene-gene and gene- 
environment interactions among two genetic and one 
environmental factor in relation to the risk to develop a 
complex disease. It is based on data with a realistic pat- 
tern of LD and it has no limitations either on the number 
of individuals that can be simulated or on the number 
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of genetic and environmental factors within a simulated 
data set. Furthermore, a large amount of effort has been 
channeled into allowing the input of parameters as stan- 
dard epidemiological quantities so that the software is 
immediately usable by the biomedical community. 

GENS2 provides large biologically realistic data sets 
with known features that can be used to challenge, and 
eventually improve, the statistical tools that are designed 
to identify those interactions. 

Methods 

Here we present the mathematical background underly- 
ing the extension of the earlier model [14] to the case of 
two (possibly interacting) DPLs. For simplicity, we have 
described the case of two DPLs and one environmental 
factor. Thus, we can generally assume that: 

1. the genetics can influence the disease risk either 
directly or by modifying the effect of the 
environment. 

2. the genetic loci can have independent effects (no 
epistasis) or can interact in an epistatic manner, and 

3. the DPLs are not in LD. 

The Multi-Logistic Model 

To model these situations we applied the MLM, here 
briefly summarized, that uses a different logistic func- 
tion for each combination of the two genotypes [14]. The 
dependent variable of the functions is the disease risk 
while the independent variable is the environmental expo- 
sure. For diploid loci, denoted by A and B two disease 
predisposing alleles, there are three distinct genotypes for 
each locus, namely AA, Aa, aa and BB, Bb, bb. For an indi- 
vidual carrying a combined genotype (g a >gb) (with g a e 
GA = {AA.Aa.aa} and g b e GB = {BB,Bb,bb}) who 
is exposed to the environmental level x, the disease risk 
is defined under the MLM by the conditional probability 
P (affected \g a >gb> x ) > which is parameterized as: 



P (affected \g a ,gb*x) = [l+exp{a^) + p(g afgb) x}] 1 

(1) 



where ot{g at g b ) and P(g a ,g b ) are free parameters determined 
by the genetic factors that determine the shape of the 
function. 

To simulate a population, the coefficients oi(g a ,g b ) and 
P(ga>gb) °f tne l°gi st i c functions that produce the desired 
population features have to be determined. This task is 
performed by the KAPS2 module that finds the corre- 
sponding MLM coefficients by considering all the desired 
population features within a set of biological constraints. 



Determination of MLM parameters 

Let Pg a and P gb be the genotypic frequencies for variables 
A and B and let m be the prevalence of the disease. Starting 
from these values and using the independence hypothe- 
sis for the variables A and B, the probability P(g a ,g b ) for 
an individual to carry the genotype (g a >gb) is the prod- 
uct P ga P gb . If for each combined genotype (g a >gb) the total 
risk for the disease insurgence TR(g a>gb ) is known, then this 
parameter represents the probability for an individual to 
be affected given the carried genotype (g a >gb)- The value 
of this parameter is obtained with the MLM as 



P(affected\g a ,gb)= / ft 
Jx 1 



P E (x) 



dx 



x[l + eM a (ga,gb)+P(ga,g b ) X }] 

- TR(g a ,g b ) (2) 

where X is the domain of the environmental variable. 

Because every logistic function in MLM is character- 
ized by its own parameters, the 3x3 pairs of values 
( a (ga,gb)> P(ga,gb)^ that satisfy the constraints expressed by 
equation (2) need to be found. 

Modeling GxE 

In general, equation (2) admits infinite solutions. How- 
ever, the GxE model imposes some constraints on the 
coefficients. Thus, by fixing the value of one of the coef- 
ficients P(g a ,g b ), hereafter denoted as /3ab> the number of 
degrees of freedom of the system can be reduced, draw- 
ing one solution from the equation system. By convention, 
we chose to associate Pab to the genotype with highest 
risk; it is easy to show that this value corresponds to the 
natural logarithm of the odds ratio of the risk which is 
related to the increase of one unit of the environmental 
exposure. Constraints imposed on the system by each one 
of the proposed gene environment interactions model are 
summarized below: 

• Genetic effect (GEN): ^ 0 and ^ gb) = 0 
V(fo,ft) £ GA x G£and 

= °W,) V (ga,gb) > {gx>g y ) eGAx GB]. 

• Environmental effect (ENV): ot(g a ,g b ) = ot(gi,g k ) and 
P(ga,g b ) = P(gi,gk) = Pab^O 

V (ga,gb)>(ghgk) eGAxGB. 

• Modulative effect (GEM): d(g a ,g b ) = U(gi,g k ) and 
P(ga,g b ) rOV (g a ,g b ) , (g hgk ) eGAx GB. 

• Additive effect (ADD): oi{g aigb) ^ 0 and 
Piga,g b ) =Pab^0V (g a ,g b ) eGAx GB. 

When the interaction model, the matrix containing total 
risk values for each combination of genotypes, namely 
TR(g a ,gb)> an d the value for the coefficient Pab have been 
denned, a set of six transcendent equations can be written 
with the coefficients of the logistic functions (except /3ab) 
as the unknown variables; these equations admit exactly 
one solution [14]. 
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Modeling G x G 

To determine TR(g a>gb ), further assumptions concerning 
the role played by GxG are required. First, the values that 
are used need to be consistent with the provided disease 
prevalence m, namely 

P(affected) = E TR (g^g b ) P (ga,g b ) - ™ (3) 

gaeGAg b eGB 

Moreover, the total risk values associated with the geno- 
types of a single locus are related to those of combined 
genotypes via marginalization: 

P(affected| & ) = ^ E TR (ga,g b )P(g a ,g b ) - TR ga (4) 



l ga 



gb eGB 



P(affected|&) = ^ E TR (ta,g b )P(ga,g b ) = (5) 

In general, once the marginals I7? & and TR gb are given, 
there are infinite choices for the matrix TR(g atgb ) that sat- 
isfy the constraints imposed by Eq.s (4) and (5); each 
matrix is representative of a particular GxG model. Of 
the possible choices, the case of no epistasis represents 
a situation where TR^ a}gb ) is determined starting from 
the fixed values of TR ga and TR gb only. In such a model, 
genetic factors independently contribute to the probabil- 
ity of being affected and are conditionally independent 
given the disease status. Under these assumptions the fol- 
lowing relationship which satisfies the constraints of Eq.s 
(4) and (5) is easily obtained: 



TR{ 



(ga>g b ) 



m 



Notice that the superscript "I" is a reminder that the 
independent polygenic model has been assumed. 

Using an independent polygenic model and a deforma- 
tion procedure, epistatic interactions among DPLs can be 
modeled to obtain a matrix TR^^ (where superscript 
"E" stands for epistatic) that still complies with constraints 
(4) and (5). In this approach, epistasis is modeled as a 
departure from the independent polygenic model via a 
change (positive or negative) in one or more entries of 



TR 1 

1K (ga,g b Y 

Let A g R 



3x3 
-1,1] 



be a matrix with the same dimen- 
sionality as TR 1 , where each entry A(g aigb ) represents the 
variation of the element TR 1 , ^ as a result of the epistatic 
interaction of the combined genotypes (g af gb)* By defini- 

tion TRE (ga,g b ) = TRl iga,g b ) + A (fe**) and must satis fy the 
condition 



Using the expressions in Eq.s (4) and ( 5) we get 

g b eGB 

T,^ TRl (ga,g b ) + A ^g b )) R ga = TR gb (8) 

gaeGA 

Because by construction, the matrix TR 1 ^ ^ already sat- 
isfies the constraints (4) and (5), the two following consis- 
tence conditions can be derived, 



E A (ga£b) P gb = 0 

gbeGB 

E A (ga,g b ) R ga = 0 



(9) 
(10) 



gaeGA 



0<TRl )gb) <l V( M ) 



(6) 



Once the quantities TRg a , TR gb) P ga and P gb are given, the 
constraints (9) and (10) define a convex region in R 3x3 
in which the elements are assignments for the entries of 
matrix A(g aigb y The specification of an epistatic model 
is, therefore, made through the definition of an incre- 
ments matrix A(g a g b ) that complies with the constraints 
(9) and (10) and that also satisfies the positivity condition 
for TR^y It can be difficult for a user to specify such a 
matrix in a way that does not violate the above constraints. 
However, the number of entries of A(g a , gb ) that the user 
has to provide (paying attention to avoiding extreme or 
off-range values) can be reduced by letting the system find 
the remaining entries. 

More precisely (in the two variables case), given the con- 
straints of Eq.s (9) and (10) from one up to three entries 
for A(g a g b ) can be provided following the rule that any pair 
must lie on the same row or in the same column. If the user 
correctly provides three values, the system admits only 
one assignment for unspecified values of A(g a g b y, however, 
if the user provides less than three values, there are an 
infinite number of ways to choose the remaining entries 
of A(g a g b )* In such a case, instead of randomly choosing a 
solution, a solution that maximizes an "objective function" 
is chosen. The problem of fixing the remaining values of 
A (ga*gb) can ^ e represented as a continuous mathemat- 
ical programming problem with decision variables that 
are the non-user-provided entries of A(g a ,g b ) and whose 
admissible region can be determined by Eq.s (9), (10) and 
(6). 

An objective function can be used to minimize the 
variance of the set of ratios A^^/TR 1 ^ ^ computed 

V(ga>gb) corresponding to non-user-assigned A^^). 
Such a function is suitable for use in all situations in which 
the relationships between existing variables for which the 
user does not provide increments are to be maintained as 
far as is possible. 
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Establishing the disease status 

Once the coefficients of the MLM are fixed, the disease 
risk for each individual in a population can be established 
by substituting the coefficients associated with the carried 
genotype into Eq. (1) and then by evaluating the resulting 
logistic function forthe exposure level of the environ- 
mental disease factor. Finally, to assign the disease status 
to each individual, the disease risk is compared with a 
random number drawn from a uniform distribution. 

Availability and requirements 

Project name: Gene-Environment iNteraction Simulator 
2 

Project home page: http://sourceforge.net/projects/ 
gensim/ 

Operating system(s): Platform independent 
Programming language: Python 

Other requirements: SimuPop, OpenOpt, wxPython 
(optional) 

License: GNU GPLv3 



Additional files 



Additional file 1 : The GENS2 graphic user interface. Flowchart 
showing a typical way of using GENS2 through its graphical user interface. 
Portable Network Graphics (.png) image file. 

Additional file 2: Association test in the case of additive G x E. The 

population comprised 1 ,000 cases and 1 ,000 controls. Two DPLs (RR=1 .6, 
W=0.5) in an additive G x E model (OR=1 .2) with no epistatic interaction 
were present. The two DPLs are in two distinct genomic regions (Chr 8: 
1 1 5,755,575-1 20,750,91 3 in yellow; Chr 1 0: 1 1 2,253,020-1 1 7,247,095 in 
cyan). In the upper panel, the Manhattan plot shows the significance of the 
association (— log 10 (p-value)) of each marker when tested individually 
(each dot represents a different marker). The red dashed line represents the 
significance threshold (0.05 after Bonferroni correction) and the green 
dashed lines mark the position of the DPLs. In the bottom panel, the r 2 for 
each marker with the DPL in the same region is shown. Portable Network 
Graphics (.png) image file. 

Additional file 3: Association test in the case of modulative G x E. The 

population comprised 1 0,000 cases and 1 0,000 controls. Two DPLs (RR=1 .6, 
W=0.5) in a modulative GxE model (OR=1.2) with no epistatic interaction 
were present. The two DPLs are in two distinct genomic regions (Chr 8: 
1 1 5,755,575-1 20,750,91 3 in yellow; Chr 1 0: 1 1 2,253,020-1 1 7,247,095 in 
cyan). In the upper panel, the two Manhattan plots show the significance 
of the association (- log 10 (p-value)) of each marker when tested 
individually (each dot represents a different marker), using a multiplicative 
and an additive model in the logistic regression. The red dashed line 
represents the significance threshold (0.05 after Bonferroni correction) and 
the green dashed lines mark the position of DPLs. In the bottom panel, the 
r 2 of each marker with the DPL on the same region is shown. Portable 
Network Graphics (.png) image file. 

Additional file 4: Expected and observed penetrance values plotted 
for each combined genotype and for different sample sizes. In each of 
the panels one of the possible combined genotypes is shown. The 
genotypes (1 , 2, and 3) are ordered according to their predicted affect on 
the overall disease risk. The x-axes show the sample size and the y-axes 
show the risk. The green lines represent the expected risk, the blue lines 
show the median observed risk, and the red dashed lines indicate the 
minimum and maximum observed disease risk in 100 replicates. Portable 
Network Graphics (.png) image file. 
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